Introduction

IMDb is the world’s most popular online database containing ratings and review for information related to movies, television series, you name it. As consumers, we want to look at what other people think of a movie or show we might be interested in watching, and IMDb is often the go-to destination. The aim of this project is to see what factors might have an influence on predicting a netflix movie or show’s genre, and to see whether it’s different between the two types. Both of the datasets for movies and shows we are using are from kaggle.

Exploratory Data Analysis

Loading Packages

library(dplyr)
library(ggplot2)
library(tidymodels)
library(tidyverse)
library(naniar)
library(patchwork) # plotting graphs side by side
library(corrplot) # correlation plot
library(ggthemes)
library(kableExtra)
library(glmnet)
library(kknn) # for knn
library(ranger) # for random forest
library(xgboost) # for boost trees
library(yardstick)
library(vip)
tidymodels_prefer()

Loading and Exploring the Data

movies1 <- read.csv("/Users/alainaliu/Downloads/PSTAT 131/Netflix Project/Best Movies Netflix.csv")
shows1 <- read.csv("/Users/alainaliu/Downloads/PSTAT 131/Netflix Project/Best Shows Netflix.csv")
movies1[,-1] %>% kable() %>%
  kable_styling("striped", full_width = FALSE) %>%
  scroll_box(height = "420px")
TITLE RELEASE_YEAR SCORE NUMBER_OF_VOTES DURATION MAIN_GENRE MAIN_PRODUCTION
David Attenborough: A Life on Our Planet 2020 9.0 31180 83 documentary GB
Inception 2010 8.8 2268288 148 scifi GB
Forrest Gump 1994 8.8 1994599 142 drama US
Anbe Sivam 2003 8.7 20595 160 comedy IN
Bo Burnham: Inside 2021 8.7 44074 87 comedy US
Saving Private Ryan 1998 8.6 1346020 169 drama US
Django Unchained 2012 8.4 1472668 165 western US
Dangal 2016 8.4 180247 161 action IN
Bo Burnham: Make Happy 2016 8.4 14356 60 comedy US
Louis C.K.: Hilarious 2010 8.4 11973 84 comedy US
Dave Chappelle: Sticks & Stones 2019 8.4 25687 65 comedy US
3 Idiots 2009 8.4 385782 170 comedy IN
Black Friday 2004 8.4 20611 143 crime IN
Super Deluxe 2019 8.4 13680 176 thriller IN
Winter on Fire: Ukraine’s Fight for Freedom 2015 8.3 17710 98 documentary UA
Once Upon a Time in America 1984 8.3 342335 229 drama US
Taxi Driver 1976 8.3 795222 113 crime US
Like Stars on Earth 2007 8.3 188234 165 drama IN
Bo Burnham: What. 2013 8.3 11488 60 comedy US
Full Metal Jacket 1987 8.3 723306 116 drama GB
Warrior 2011 8.2 463276 140 drama US
Drishyam 2015 8.2 79075 163 thriller IN
Queen 2014 8.2 64805 146 drama IN
Paan Singh Tomar 2012 8.2 35888 135 drama IN
Cowspiracy: The Sustainability Secret 2014 8.2 24845 90 documentary US
Virunga 2014 8.2 11403 90 war CD
PK 2014 8.2 178012 153 comedy IN
Bāhubali 2: The Conclusion 2017 8.2 91560 168 fantasy IN
Monty Python and the Holy Grail 1975 8.2 530877 91 comedy GB
Article 15 2019 8.2 32336 130 crime IN
Miracle in Cell No. 7 2019 8.2 46939 132 drama TR
13th 2016 8.2 34914 100 documentary US
Andhadhun 2018 8.2 88359 139 thriller IN
Bill Burr: Paper Tiger 2019 8.1 10649 67 comedy GB
Udaan 2010 8.1 44556 138 drama IN
How to Train Your Dragon 2010 8.1 719717 98 fantasy US
Klaus 2019 8.1 141480 97 comedy ES
Swades 2004 8.1 89085 189 drama IN
Minnal Murali 2021 8.1 24681 158 action IN
Rang De Basanti 2006 8.1 118092 157 comedy IN
Seaspiracy 2021 8.1 29604 80 documentary US
Rush 2013 8.1 465254 123 drama US
Hannah Gadsby: Nanette 2018 8.1 12035 69 comedy AU
Barfi! 2012 8.1 80643 151 drama IN
Haider 2014 8.1 54001 150 drama IN
Zindagi Na Milegi Dobara 2011 8.1 75801 166 comedy IN
A Silent Voice: The Movie 2016 8.1 75132 130 romance JP
OMG: Oh My God! 2012 8.1 57449 125 fantasy IN
Talvar 2015 8.1 34659 132 thriller IN
Into the Wild 2007 8.1 611379 140 drama US
Lagaan: Once Upon a Time in India 2001 8.1 111053 224 drama IN
My Octopus Teacher 2020 8.1 51232 84 documentary ZA
Dil Chahta Hai 2001 8.1 71167 183 drama IN
Mersal 2017 8.1 32573 169 thriller IN
The Legend of Bhagat Singh 2002 8.1 16225 155 drama IN
Stand by Me 1986 8.1 392790 89 drama US
The Exorcist 1973 8.1 391942 133 horror US
Bombay 1995 8.1 12512 141 romance IN
Dasvi 2022 8.0 13140 125 drama IN
G.O.R.A. 2004 8.0 61797 127 scifi TR
Blood Diamond 2006 8.0 536858 143 thriller US
Vizontele 2001 8.0 36291 110 comedy TR
Ip Man 2008 8.0 221095 108 drama HK
Her 2013 8.0 586679 126 drama US
The Bourne Ultimatum 2007 8.0 627009 115 thriller DE
Casino Royale 2006 8.0 644336 139 thriller GB
Special 26 2013 8.0 55489 144 thriller IN
Neon Genesis Evangelion: The End of Evangelion 1997 8.0 51938 87 scifi JP
Bāhubali: The Beginning 2015 8.0 117333 159 drama IN
Ankhon Dekhi 2014 8.0 11330 104 drama IN
Big Fish 2003 8.0 435503 125 drama US
Silenced 2011 8.0 15889 125 drama KR
Life of Brian 1979 8.0 392419 94 comedy GB
The Invisible Guest 2017 8.0 170351 107 thriller ES
Dave Chappelle: The Closer 2021 8.0 24903 72 comedy US
Blade Runner 2049 2017 8.0 539864 164 scifi CA
The Imitation Game 2014 8.0 748654 113 thriller US
Jab We Met 2007 7.9 51945 138 comedy IN
Free to Play 2014 7.9 13308 75 documentary UA
Monty Python Live at the Hollywood Bowl 1982 7.9 15186 77 comedy GB
Dev.D 2009 7.9 30389 144 drama IN
Marriage Story 2019 7.9 290643 136 drama GB
Icarus 2017 7.9 48672 121 documentary US
I Am Not Your Negro 2017 7.9 21632 93 documentary BE
Ricky Gervais: Humanity 2018 7.9 18523 79 comedy GB
Secret Superstar 2017 7.9 24046 150 drama IN
Kal Ho Naa Ho 2003 7.9 68028 186 drama IN
Pad Man 2018 7.9 25269 140 comedy IN
Shyam Singha Roy 2021 7.9 10903 157 drama IN
How to Train Your Dragon 2 2014 7.8 327565 102 fantasy US
The Irishman 2019 7.8 371209 209 drama US
Gangaajal 2003 7.8 17029 157 drama IN
The Girl with the Dragon Tattoo 2011 7.8 454917 158 crime NO
Lakshya 2004 7.8 23076 186 drama IN
The Social Network 2010 7.8 681286 121 drama US
Dunkirk 2017 7.8 619645 107 drama US
Kai Po Che! 2013 7.8 36512 126 drama IN
The Gentlemen 2019 7.8 314049 113 comedy US
Marco Polo: One Hundred Eyes 2015 7.8 10742 28 action US
The Hateful Eight 2015 7.8 570138 188 western US
Hunt for the Wilderpeople 2016 7.8 125720 101 comedy NZ
The Last Samurai 2003 7.8 429097 154 drama NZ
Gattaca 1997 7.8 298168 106 thriller US
The Butterfly’s Dream 2013 7.8 21882 138 drama TR
14 Peaks: Nothing Is Impossible 2021 7.8 22858 101 documentary US
Nightcrawler 2014 7.8 523686 118 crime US
Udta Punjab 2016 7.8 29819 148 crime IN
My Fair Lady 1964 7.8 94121 170 drama US
System Crasher 2019 7.8 12699 118 drama DE
The Game Changers 2019 7.8 19708 88 documentary US
Awakenings 1990 7.8 137549 120 drama US
Badla 2019 7.8 27130 120 thriller IN
Madras Cafe 2013 7.7 24319 130 thriller IN
Beasts of No Nation 2015 7.7 80129 137 war US
Bonnie and Clyde 1967 7.7 111189 110 drama US
Roma 2018 7.7 153508 135 drama MX
Kapoor & Sons 2016 7.7 25792 132 romance IN
Mucize 2015 7.7 12395 136 drama TR
Wind River 2017 7.7 240408 106 thriller FR
Oye Lucky! Lucky Oye! 2008 7.7 17411 126 comedy IN
Dirty Harry 1971 7.7 153463 102 thriller US
Jim & Andy: The Great Beyond - Featuring a Very Special, Contractually Obligated Mention of Tony Clifton 2017 7.7 25593 94 comedy US
Berserk: The Golden Age Arc II - The Battle for Doldrey 2012 7.7 10257 80 fantasy JP
Silver Linings Playbook 2012 7.7 697481 122 drama US
Sanju 2018 7.7 52227 161 drama IN
Argo 2012 7.7 600392 120 drama US
Guru 2007 7.7 23541 166 romance IN
Road to Perdition 2002 7.7 263212 117 thriller US
The Trial of the Chicago 7 2020 7.7 170728 130 drama US
When Harry Met Sally… 1989 7.7 212913 96 romance US
Rock On!! 2008 7.7 21963 144 drama IN
In the Family 2017 7.7 23297 124 comedy TR
Pyaar Ka Punchnama 2011 7.7 21204 149 romance IN
Midnight in Paris 2011 7.7 413541 94 fantasy US
Donnie Brasco 1997 7.7 300073 127 thriller US
The Blind Side 2009 7.6 323939 129 drama US
Sherlock Holmes 2009 7.6 620154 129 crime GB
Stardust 2007 7.6 269043 122 fantasy US
Kuch Kuch Hota Hai 1998 7.6 51640 185 drama IN
Parmanu: The Story of Pokhran 2018 7.6 23771 129 drama IN
Doctor 2021 7.6 14590 150 thriller IN
What Happened, Miss Simone? 2015 7.6 13703 101 musical US
Eddie Murphy Raw 1987 7.6 19646 93 comedy US
Delhi Belly 2011 7.6 29578 102 comedy IN
The Boy Who Harnessed the Wind 2019 7.6 36805 113 drama MW
Kabhi Haan Kabhi Naa 1994 7.6 18224 158 comedy IN
The Two Popes 2019 7.6 120871 125 drama US
The Social Dilemma 2020 7.6 79674 94 drama US
Bad Genius 2017 7.6 20430 130 thriller TH
The Mitchells vs. the Machines 2021 7.6 100787 113 animation US
Gifted Hands: The Ben Carson Story 2009 7.6 10210 86 drama US
Tell Me Who I Am 2019 7.6 14215 85 thriller GB
RBG 2018 7.6 14037 98 documentary US
Highway 2014 7.6 28370 133 drama IN
Athlete A 2020 7.6 10544 104 documentary US
Lupin the Third: The Castle of Cagliostro 1979 7.6 30277 100 comedy JP
Love Actually 2003 7.6 474176 139 drama GB
Hell or High Water 2016 7.6 224900 102 western US
Wake Up Sid 2009 7.6 30818 138 comedy IN
Ludo 2020 7.6 37528 150 crime IN
Stree 2018 7.6 32814 128 horror IN
Aamir 2008 7.6 11241 99 thriller IN
I Am Sam 2001 7.6 149082 132 drama US
True Grit 2010 7.6 333378 110 western US
The Distinguished Citizen 2016 7.5 11495 118 comedy AR
Jodhaa Akbar 2008 7.5 32188 213 romance IN
The Conjuring 2013 7.5 491048 107 thriller US
Dhamaka 2021 7.5 39620 104 thriller IN
Bareilly Ki Barfi 2017 7.5 23011 123 romance IN
Les Misérables 2012 7.5 325132 157 drama GB
Berserk: The Golden Age Arc I - The Egg of the King 2012 7.5 12278 76 fantasy JP
Dil Se.. 1998 7.5 28409 163 drama IN
Omar 2013 7.5 14230 96 thriller PS
I Lost My Body 2019 7.5 31531 81 fantasy FR
tick, tick… BOOM! 2021 7.5 96418 121 drama US
Coming Soon 2014 7.5 33714 134 drama TR
Nocturnal Animals 2016 7.5 264884 115 drama US
A Monster Calls 2016 7.5 86614 108 fantasy ES
On Body and Soul 2017 7.5 27003 116 fantasy HU
Ala Vaikunthapurramuloo 2020 7.5 14839 165 drama IN
The Devil’s Advocate 1997 7.5 361422 144 horror DE
Sivaji: The Boss 2007 7.5 19556 189 drama IN
Happy as Lazzaro 2018 7.5 17716 125 fantasy IT
The Guns of Navarone 1961 7.5 50150 158 war US
Who Am I 2014 7.5 55044 105 thriller DE
White Christmas 1954 7.5 42373 115 romance US
Ip Man 2 2010 7.5 103673 108 drama CN
42 2013 7.5 93314 128 drama US
Menace II Society 1993 7.5 57399 97 drama US
Hum Aapke Hain Koun..! 1994 7.5 20986 206 romance IN
Blow 2001 7.5 255099 124 drama US
Begin Again 2013 7.4 154049 104 comedy US
Uncut Gems 2019 7.4 261956 130 drama US
Sherlock Holmes: A Game of Shadows 2011 7.4 446531 129 crime US
Rurouni Kenshin Part I: Origins 2012 7.4 25793 134 drama JP
Forgotten 2017 7.4 29804 109 thriller KR
Tamasha 2015 7.4 26790 139 drama IN
Mudbound 2017 7.4 47676 120 drama US
Lady Bird 2017 7.4 277165 94 drama US
Molly’s Game 2017 7.4 165817 140 drama CA
Raman Raghav 2.0 2016 7.4 14380 134 thriller IN
Once Upon a Time in Mumbaai 2010 7.4 17494 132 thriller IN
Jaane Tu… Ya Jaane Na 2008 7.4 26738 155 drama IN
Peepli Live 2010 7.4 12265 104 drama IN
Even the Rain 2010 7.4 13446 104 drama ES
Darkest Hour 2017 7.4 193208 125 drama GB
The Bucket List 2007 7.4 242733 97 drama US
American Factory 2019 7.4 21415 110 documentary US
Miss Americana 2020 7.4 19151 85 documentary US
The Hand of God 2021 7.4 30235 130 drama IT
A Nightmare on Elm Street 1984 7.4 230543 91 horror US
83 2021 7.4 23781 163 drama IN
Meenakshi Sundareshwar 2021 7.4 17141 141 comedy IN
Kurup 2021 7.4 11582 155 crime IN
Crazy, Stupid, Love. 2011 7.4 507878 118 romance US
Schumacher 2021 7.4 21558 112 sports DE
Guzaarish 2010 7.4 18466 126 drama IN
What the Health 2017 7.4 28911 97 documentary US
Mirage 2018 7.4 52657 129 thriller ES
Bully 2011 7.4 10266 92 drama US
Phantom Thread 2017 7.4 128600 130 romance US
Looper 2012 7.4 566791 119 thriller US
Felon 2008 7.4 78039 103 crime US
Life in a Metro 2007 7.4 11934 124 drama IN
Kabhi Khushi Kabhie Gham 2001 7.4 48818 210 drama IN
Kaminey 2009 7.4 17136 135 drama IN
Girl, Interrupted 1999 7.3 180532 127 drama US
Talaash 2012 7.3 41752 149 thriller IN
Starship Troopers 1997 7.3 288960 129 scifi US
The Guernsey Literary & Potato Peel Pie Society 2018 7.3 44917 124 romance GB
Okja 2017 7.3 116305 122 drama KR
Ishqiya 2010 7.3 10415 115 comedy IN
The Edge of Seventeen 2016 7.3 120488 104 comedy US
Pyaar Ka Punchnama 2 2015 7.3 14968 159 comedy IN
The Nightingale 2018 7.3 28196 136 thriller AU
Corpse Bride 2005 7.3 265023 77 fantasy US
Memoirs of a Geisha 2005 7.3 146847 145 drama FR
El Camino: A Breaking Bad Movie 2019 7.3 216847 123 thriller US
Official Secrets 2019 7.3 45200 112 thriller GB
Coach Carter 2005 7.3 143670 130 drama US
The King 2019 7.3 119020 140 drama AU
Identity 2003 7.3 240433 90 thriller US
The Best of Enemies 2019 7.3 16441 133 drama US
The Professionals 1966 7.3 16168 117 western US
Raat Akeli Hai 2020 7.3 17570 149 thriller IN
Little Women 1994 7.3 57621 115 drama US
Badhaai Do 2022 7.3 15032 147 comedy IN
The Conjuring 2 2016 7.3 260693 134 thriller US
The Fundamentals of Caring 2016 7.3 70542 97 drama US
The Ballad of Buster Scruggs 2018 7.3 141528 132 western US
Shot Caller 2017 7.3 83961 120 thriller US
The Disaster Artist 2017 7.3 149604 104 comedy US
Blue Jay 2016 7.3 17033 81 romance US
Te3n 2016 7.3 12816 136 thriller IN
Monster 2003 7.3 149218 110 crime US
No One Killed Jessica 2011 7.2 11665 136 crime IN
Toilet: A Love Story 2017 7.2 20675 155 comedy IN
Wish Dragon 2021 7.2 24712 99 fantasy CN
Mom 2017 7.2 10320 147 crime IN
The Professor and the Madman 2019 7.2 44418 124 thriller US
Kabir Singh 2019 7.2 30949 172 drama IN
Dolemite Is My Name 2019 7.2 59836 118 comedy US
In the Line of Fire 1993 7.2 101939 128 drama US
The Witcher: Nightmare of the Wolf 2021 7.2 41508 83 fantasy PL
Ittefaq 2017 7.2 12095 107 thriller IN
The Tinder Swindler 2022 7.2 57606 114 crime GB
First They Killed My Father 2017 7.2 17871 136 drama KH
Don’t Look Up 2021 7.2 498447 138 scifi US
Wazir 2016 7.2 18681 103 thriller IN
Gabbar Is Back 2015 7.2 24676 130 drama IN
Fyre 2019 7.2 44715 98 documentary US
Steve Jobs 2015 7.2 166288 122 drama GB
Shooter 2007 7.2 329417 124 thriller US
The Butler 2013 7.2 114013 132 drama US
The Siege of Jadotville 2016 7.2 38308 108 thriller IE
Closer 2004 7.2 215678 94 drama GB
Private Life 2018 7.2 19023 123 drama US
American Murder: The Family Next Door 2020 7.2 26355 83 crime US
St. Vincent 2014 7.2 102103 102 comedy US
A River Runs Through It 1992 7.2 59086 123 drama US
The Patriot 2000 7.2 270231 165 drama DE
Five Feet Apart 2019 7.2 61878 116 romance US
Michael Clayton 2007 7.2 163878 120 thriller US
The Edge of Democracy 2019 7.2 14605 121 documentary BR
Paddington 2014 7.2 111092 96 comedy FR
Wedding Association 2013 7.1 22186 106 comedy XX
Let Me In 2010 7.1 120208 116 horror GB
Raajneeti 2010 7.1 17555 167 drama IN
Body of Lies 2008 7.1 224896 128 drama GB
The Forgotten Battle 2020 7.1 26368 124 drama LT
Knock Down the House 2019 7.1 12418 86 documentary US
The Call 2020 7.1 29450 112 thriller KR
Forgetting Sarah Marshall 2008 7.1 280121 111 comedy US
The Railway Man 2013 7.1 39743 116 drama GB
The Great Hack 2019 7.1 22838 114 documentary US
Paddleton 2019 7.1 13419 89 drama US
Desperado 1995 7.1 183638 104 thriller US
Gantz:O 2016 7.1 14501 95 animation JP
The Ring 2002 7.1 341888 111 horror JP
Margin Call 2011 7.1 125883 107 thriller US
Copenhagen 2014 7.1 13135 98 drama US
Black Mirror: Bandersnatch 2018 7.1 123377 90 scifi GB
Metallica: Through the Never 2013 7.1 17433 93 musical US
Seven Years in Tibet 1997 7.1 141308 136 drama US
Girl 2018 7.1 14046 105 drama NL
Kung Fu Panda 3 2016 7.1 152791 95 comedy US
To All the Boys I’ve Loved Before 2018 7.1 101175 100 romance US
Arthur Christmas 2011 7.1 58296 97 drama GB
Trailer Park Boys: The Movie 2006 7.1 12831 95 comedy CA
Karthik Calling Karthik 2010 7.1 11944 135 thriller IN
The White Tiger 2021 7.1 58190 125 drama IN
Shootout at Lokhandwala 2007 7.1 10139 145 crime IN
The Devil All the Time 2020 7.1 122321 138 drama US
Tangerine 2015 7.1 31385 87 comedy US
The Danish Girl 2015 7.1 180805 119 drama DE
The Unforgivable 2021 7.1 101975 112 drama DE
Phir Hera Pheri 2006 7.1 22505 155 comedy IN
War Dogs 2016 7.1 208185 114 crime US
The Dig 2021 7.1 71915 112 drama GB
Blade 1998 7.1 267181 120 action US
Luck by Chance 2009 7.1 10206 155 romance IN
Namastey London 2007 7.1 21745 131 comedy IN
Don 2006 7.1 36836 178 thriller IN
Don 2 2011 7.1 52338 139 action DE
Main Hoon Na 2004 7.0 35142 179 drama IN
Gangubai Kathiawadi 2022 7.0 44045 157 drama IN
Croupier 1998 7.0 21382 94 thriller FR
Pieces of a Woman 2020 7.0 47795 127 drama HU
Raw 2016 7.0 72460 98 horror BE
Loving 2016 7.0 34139 123 drama GB
Harold & Kumar Go to White Castle 2004 7.0 193053 88 comedy US
Haseen Dillruba 2021 7.0 25771 135 drama IN
Rose Island 2020 7.0 20019 117 drama IT
Ferry 2021 7.0 10748 106 drama NL
Handsome Devil 2017 7.0 13708 95 drama IE
Torbaaz 2020 7.0 17828 132 drama IN
Ip Man 3 2015 7.0 54128 105 drama HK
The Foreigner 2017 7.0 112487 113 thriller IN
Mirai 2018 7.0 14983 98 fantasy JP
Gaga: Five Foot Two 2017 7.0 12825 100 musical US
The Christmas Chronicles 2018 7.0 71182 104 fantasy US
Sarkar 2018 7.0 18453 164 drama IN
Rambo 2008 7.0 228799 92 thriller US
Dil Dhadakne Do 2015 7.0 17149 170 drama IN
Soul Surfer 2011 7.0 49101 112 drama US
Happy Gilmore 1996 7.0 217534 92 comedy US
Den of Thieves 2018 7.0 107701 140 thriller US
The Zookeeper’s Wife 2017 7.0 42808 122 drama GB
The Platform 2019 7.0 207877 94 horror ES
Public Enemies 2009 7.0 297525 143 crime US
Ip Man 4: The Finale 2019 7.0 30694 107 drama CN
John Q 2002 7.0 131999 116 drama US
Fashion 2008 6.9 12468 167 romance IN
Ma Rainey’s Black Bottom 2020 6.9 50275 94 musical US
Any Given Sunday 1999 6.9 118479 162 drama US
The Long Riders 1980 6.9 11329 99 western US
Get on Up 2014 6.9 24456 139 drama US
Fukrey 2013 6.9 11656 137 romance IN
Chup Chup Ke 2006 6.9 10528 165 comedy IN
Then Came You 2019 6.9 12356 93 comedy US
The Wolf’s Call 2019 6.9 17236 115 thriller FR
Cloudy with a Chance of Meatballs 2009 6.9 226225 90 animation US
Our Souls at Night 2017 6.9 13360 101 drama US
Chandigarh Kare Aashiqui 2021 6.9 12303 117 crime IN
Welcome 2007 6.9 21799 160 romance IN
Everybody Knows 2018 6.9 34009 132 drama IT
The Meyerowitz Stories (New and Selected) 2017 6.9 47971 112 drama US
I Don’t Feel at Home in This World Anymore 2017 6.9 55549 93 drama US
The Highwaymen 2019 6.9 88714 132 thriller US
Legend 2015 6.9 175273 132 thriller US
AK vs AK 2020 6.9 14048 108 drama IN
My Girl 1991 6.9 79800 103 comedy US
Amanda Knox 2016 6.9 23969 92 crime DK
2 States 2014 6.9 25344 149 comedy IN
The Power of the Dog 2021 6.9 158487 126 drama CA
The Half of It 2020 6.9 34959 104 comedy US
Outlaw King 2018 6.9 69834 121 drama GB
Mary Kom 2014 6.9 10656 122 drama IN
Suffragette 2015 6.9 41529 106 drama FR
Legend of the Guardians: The Owls of Ga’Hoole 2010 6.9 82623 100 fantasy US
Christine 2016 6.9 14977 115 drama US
The Night Comes for Us 2018 6.9 25500 121 thriller ID
The Trip 2021 6.9 19706 113 comedy NO
The Dirt 2019 6.9 47603 108 drama US
Top Gun 1986 6.9 329656 110 drama US
Radhe Shyam 2022 6.9 21328 138 romance IN
Sorry to Bother You 2018 6.9 75653 111 fantasy US
shows1[,-1] %>% kable() %>%
  kable_styling("striped", full_width = FALSE) %>%
  scroll_box(height = "420px")
TITLE RELEASE_YEAR SCORE NUMBER_OF_VOTES DURATION NUMBER_OF_SEASONS MAIN_GENRE MAIN_PRODUCTION
Breaking Bad 2008 9.5 1727694 48 5 drama US
Avatar: The Last Airbender 2005 9.3 297336 24 3 scifi US
Our Planet 2019 9.3 41386 50 1 documentary GB
Kota Factory 2019 9.3 66985 42 2 drama IN
The Last Dance 2020 9.1 108321 50 1 documentary US
Arcane 2021 9.1 175412 41 1 action US
Attack on Titan 2013 9.0 325381 24 4 scifi JP
Hunter x Hunter 2011 9.0 87857 23 3 drama JP
DEATH NOTE 2006 9.0 302147 24 1 scifi JP
Seinfeld 1989 8.9 302700 24 9 comedy US
Cowboy Bebop 1998 8.9 112887 25 1 western JP
Heartstopper 2022 8.9 28978 28 1 drama GB
When They See Us 2019 8.9 114127 74 1 drama US
Monty Python’s Flying Circus 1969 8.8 72895 30 4 comedy GB
BoJack Horseman 2014 8.8 143584 26 6 drama US
Chappelle’s Show 2003 8.8 62140 21 3 comedy US
Better Call Saul 2015 8.8 404920 49 6 comedy US
Narcos 2015 8.8 404486 52 3 drama US
One Piece 1999 8.8 112586 23 21 action JP
Peaky Blinders 2013 8.8 485506 58 6 drama GB
Anne with an E 2017 8.7 51001 46 3 drama CA
Dark 2017 8.7 354443 56 3 scifi DE
House of Cards 2013 8.7 494092 52 6 drama US
Demon Slayer: Kimetsu no Yaiba 2019 8.7 88265 25 3 animation JP
Stranger Things 2016 8.7 989090 52 5 scifi US
One-Punch Man 2015 8.7 148386 24 2 action JP
The Crown 2016 8.7 190878 56 5 drama US
Arrested Development 2003 8.7 297552 28 5 comedy US
Friday Night Lights 2006 8.7 64449 43 5 drama US
Downton Abbey 2010 8.7 197744 58 6 drama GB
Code Geass: Lelouch of the Rebellion 2006 8.7 62367 24 3 scifi JP
Trailer Park Boys 2001 8.6 41791 25 12 comedy CA
Mindhunter 2017 8.6 261429 53 2 crime US
The Haunting of Hill House 2018 8.6 226817 58 1 drama US
The Queen’s Gambit 2020 8.6 406350 56 1 drama US
Cobra Kai 2018 8.6 163858 31 5 action US
Wentworth 2013 8.6 21747 46 9 drama AU
It’s Okay to Not Be Okay 2020 8.6 21104 76 1 drama KR
Making a Murderer 2015 8.6 93456 62 2 crime US
Shameless 2011 8.6 230243 54 11 drama US
Sacred Games 2018 8.6 85088 50 2 action IN
Formula 1: Drive to Survive 2019 8.6 36661 38 6 documentary GB
Queer Eye 2018 8.5 18147 47 6 reality US
Community 2009 8.5 252564 23 6 comedy US
The Last Kingdom 2015 8.5 126473 55 5 action GB
Schitt’s Creek 2015 8.5 112537 22 6 comedy CA
Neon Genesis Evangelion 1995 8.5 64727 24 1 scifi JP
Call the Midwife 2012 8.5 25562 56 11 drama GB
The IT Crowd 2006 8.5 147409 25 5 comedy GB
ERASED 2016 8.5 42699 22 1 drama JP
Supernatural 2005 8.5 428639 45 15 scifi US
Ozark 2017 8.5 278223 60 4 crime US
Delhi Crime 2019 8.5 18732 27 1 drama IN
After Life 2019 8.5 124972 28 3 comedy GB
Hilda 2018 8.5 10162 25 2 scifi GB
Borgen 2010 8.5 23523 58 4 drama DK
Ash vs Evil Dead 2015 8.4 70087 30 3 action US
Heartland 2007 8.4 15743 44 15 drama CA
Unbelievable 2019 8.4 95658 48 1 drama US
The Promised Neverland 2019 8.4 34730 23 2 scifi JP
Derry Girls 2018 8.4 28718 25 3 comedy GB
Innocent 2017 8.4 17727 51 1 drama TR
Trollhunters: Tales of Arcadia 2016 8.4 16509 22 3 action US
Outlander 2014 8.4 152435 60 6 scifi US
The Legend of Korra 2012 8.4 117464 23 4 action US
The Dark Crystal: Age of Resistance 2019 8.4 24164 51 1 scifi GB
Vincenzo 2021 8.4 15134 81 1 action KR
Top Boy 2011 8.4 22420 48 2 drama GB
Babylon Berlin 2017 8.4 23256 49 4 crime DE
Stargate SG-1 1997 8.4 90196 44 10 scifi US
Violet Evergarden 2018 8.4 19940 25 1 drama JP
Narcos: Mexico 2018 8.4 80902 56 3 drama US
The Dragon Prince 2018 8.4 21635 27 3 scifi US
Maid 2021 8.4 74955 54 1 drama US
Car Masters: Rust to Riches 2018 8.4 10024 39 3 reality US
Naruto 2002 8.4 93980 23 6 scifi JP
The Originals 2013 8.3 131574 43 5 scifi US
Fauda 2015 8.3 25239 40 3 war IL
Sex Education 2019 8.3 251168 52 4 drama GB
Kingdom 2019 8.3 43760 48 2 action KR
Money Heist 2017 8.3 450797 50 5 crime ES
Young Royals 2021 8.3 26732 45 1 drama SE
Atypical 2017 8.3 81643 31 4 comedy US
Gurren Lagann 2007 8.3 17024 24 1 scifi JP
30 Rock 2006 8.3 121514 25 7 comedy US
Castlevania 2017 8.3 61114 26 4 scifi US
Kim’s Convenience 2016 8.3 16970 22 5 comedy CA
Master of None 2015 8.3 72341 32 3 drama US
Longmire 2012 8.3 34362 53 6 western US
Call My Agent! 2015 8.3 13331 55 4 comedy FR
The Get Down 2016 8.2 22304 62 2 drama US
Sense8 2015 8.2 151518 62 2 scifi US
One Day at a Time 2017 8.2 15669 29 4 comedy US
The Kominsky Method 2018 8.2 38232 26 3 drama US
Itaewon Class 2020 8.2 12030 69 1 drama KR
The Good Place 2016 8.2 148562 23 4 scifi US
Grace and Frankie 2015 8.2 48435 30 7 comedy US
The Witcher 2019 8.2 465949 58 2 scifi US
Locked Up 2015 8.2 21388 50 4 drama ES
Gilmore Girls 2000 8.2 119054 46 8 comedy US
Caliphate 2020 8.2 17735 48 1 war SE
Fate/Zero 2011 8.2 12568 25 2 scifi JP
The Walking Dead 2010 8.2 945125 46 11 action US
anohana: The Flower We Saw That Day 2011 8.2 12682 23 1 drama JP
The Trials of Gabriel Fernandez 2020 8.1 10982 55 1 crime US
How to Get Away with Murder 2014 8.1 146712 43 6 drama US
Derek 2013 8.1 31976 26 3 drama GB
Rascal Does Not Dream of Bunny Girl Senpai 2018 8.1 10400 25 1 animation JP
Wild Wild Country 2018 8.1 29298 67 1 crime US
Bodyguard 2018 8.1 114446 60 2 war GB
American Vandal 2017 8.1 29972 33 2 comedy US
The End of the F***ing World 2017 8.1 177868 21 2 crime GB
Manhunt 2017 8.1 57459 43 2 documentary US
Criminal Minds 2005 8.1 189191 44 15 thriller US
HAPPY! 2017 8.1 37747 43 2 scifi US
Orange Is the New Black 2013 8.1 295591 59 7 drama US
Star Trek: Deep Space Nine 1993 8.1 61145 47 7 scifi US
Lovesick 2014 8.1 19259 24 3 comedy GB
Lucifer 2016 8.1 308291 47 6 scifi US
F is for Family 2015 8.0 36050 28 5 comedy US
Don’t F**k with Cats: Hunting an Internet Killer 2019 8.0 50250 62 1 crime US
GLOW 2017 8.0 44751 34 4 drama US
The Umbrella Academy 2019 8.0 202522 52 3 comedy US
Tabula Rasa 2017 8.0 10161 52 1 drama BE
Dead to Me 2019 8.0 73110 31 2 drama US
The Mechanism 2018 8.0 36077 47 2 drama BR
Squid Game 2021 8.0 416738 54 2 action KR
Toradora! 2008 8.0 14307 30 2 animation JP
Comedians in Cars Getting Coffee 2012 8.0 12363 20 11 comedy US
Marco Polo 2014 8.0 71229 55 2 action US
Travelers 2016 8.0 54566 45 3 scifi CA
The Blacklist 2013 8.0 238138 43 9 drama US
Unorthodox 2020 8.0 74118 54 1 drama DE
Inside Bill’s Brain: Decoding Bill Gates 2019 7.9 10776 50 1 documentary US
On My Block 2018 7.9 15642 29 4 comedy US
Medici: Masters of Florence 2016 7.9 18575 54 3 war GB
The Seven Deadly Sins 2014 7.9 30341 24 5 action JP
Queen of the South 2016 7.9 28537 42 5 drama US
KILL la KILL 2013 7.9 13372 25 1 action JP
Bloodline 2015 7.9 50028 57 3 drama US
Into the Badlands 2015 7.9 45628 43 3 action US
Merlin 2008 7.9 80138 44 5 action GB
InuYasha 2000 7.9 15823 25 9 action JP
Altered Carbon 2018 7.9 162018 52 2 scifi US
The OA 2016 7.9 100911 55 2 scifi US
Resurrection: Ertugrul 2014 7.9 35515 57 5 war TR
The Borgias 2011 7.9 50556 51 3 drama CA
Lilyhammer 2012 7.9 29196 45 3 crime NO
Boys Over Flowers 2009 7.9 11579 64 1 comedy KR
I Think You Should Leave with Tim Robinson 2019 7.9 11411 16 3 comedy US
Suburra: Blood on Rome 2017 7.9 14346 48 3 crime IT
The Sinner 2017 7.9 117055 45 4 crime US
The Spy 2019 7.9 40535 53 1 drama FR
Santa Clarita Diet 2017 7.9 64467 29 3 comedy US
How to Sell Drugs Online (Fast) 2019 7.9 30734 32 3 drama DE
Rise of Empires: Ottoman 2020 7.9 22045 45 2 documentary TR
Big Mouth 2017 7.9 74660 27 6 animation US
Versailles 2015 7.9 16273 54 3 drama CA
She-Ra and the Princesses of Power 2018 7.8 14935 24 5 scifi US
A Series of Unfortunate Events 2017 7.8 59239 47 3 action US
Gotham 2014 7.8 226081 43 5 scifi US
Workin’ Moms 2017 7.8 14892 22 6 comedy CA
Russian Doll 2019 7.8 88945 28 2 drama US
Sweet Tooth 2021 7.8 49182 45 2 scifi US
Imposters 2017 7.8 13238 43 2 drama US
iZombie 2015 7.8 66488 42 5 scifi US
Conversations with a Killer: The Ted Bundy Tapes 2019 7.8 27439 59 1 crime US
Never Have I Ever 2020 7.8 45346 28 4 drama US
My Name 2021 7.8 18446 50 1 thriller KR
Happy Endings 2011 7.8 37778 21 3 comedy US
Crazy Ex-Girlfriend 2015 7.8 19738 42 4 comedy US
El Chapo 2017 7.8 17992 44 3 drama US
NCIS 2003 7.8 141049 44 19 action US
Jane the Virgin 2014 7.8 46286 42 5 drama US
Undercover 2019 7.8 16773 49 3 drama BE
DOTA: Dragon’s Blood 2021 7.8 17429 26 2 scifi US
The Innocent 2021 7.8 29019 58 1 crime ES
The Staircase 2004 7.8 21531 49 2 crime FR
Giri/Haji 2019 7.8 13570 58 1 thriller GB
Angel Beats! 2010 7.7 13848 26 1 scifi JP
Love 2016 7.7 41362 32 3 drama US
In the Dark 2019 7.7 10927 41 4 comedy US
Midnight Mass 2021 7.7 102321 64 1 action US
The Vampire Diaries 2009 7.7 310776 42 8 scifi US
Alias Grace 2017 7.7 31577 44 1 drama CA
You 2018 7.7 225949 47 3 thriller US
Norsemen 2016 7.7 18040 30 3 comedy NO
Good Girls 2018 7.7 49867 42 4 comedy US
Seven Seconds 2018 7.7 15323 62 1 crime US
The Chestnut Man 2021 7.7 41253 55 1 crime DK
Crashing 2016 7.7 18546 30 1 comedy GB
Maniac 2018 7.7 74877 39 1 drama US
New Girl 2011 7.7 216209 21 7 comedy US
Miraculous: Tales of Ladybug & Cat Noir 2015 7.7 10102 22 4 romance FR
Home for Christmas 2019 7.7 13670 29 2 comedy NO
Inside Job 2021 7.6 15137 28 1 comedy US
The Serpent 2021 7.6 41782 57 1 drama GB
Alice in Borderland 2020 7.6 47651 47 2 action JP
Spinning Out 2020 7.6 13692 50 1 drama US
Messiah 2020 7.6 42018 45 1 drama US
Teenage Bounty Hunters 2020 7.6 10821 49 1 action US
Shadow and Bone 2021 7.6 77782 52 1 scifi US
The Devil Next Door 2019 7.6 13084 46 1 documentary US
My Little Pony: Friendship Is Magic 2010 7.6 20708 22 9 scifi CA
The Magicians 2015 7.6 49557 44 5 drama US
Bordertown 2016 7.6 10371 59 3 drama FI
Unbreakable Kimmy Schmidt 2015 7.6 70242 30 4 comedy US
Chicago Med 2015 7.6 24142 41 8 drama US
The 100 2014 7.6 242221 42 7 drama US
The Flash 2014 7.6 336888 42 8 scifi US
Cable Girls 2017 7.6 13855 50 5 drama ES
Criminal: UK 2019 7.6 17992 44 2 drama GB
Devilman Crybaby 2018 7.6 18575 25 1 scifi JP
Madam Secretary 2014 7.6 22563 43 6 war US
Sword Art Online 2012 7.6 43727 23 4 scifi JP
Gilmore Girls: A Year in the Life 2016 7.6 36359 92 1 comedy US
Dead Set 2008 7.6 19684 141 1 scifi GB
Grey’s Anatomy 2005 7.6 293618 49 18 drama US
Goosebumps 1995 7.6 13361 22 4 scifi CA
Love 101 2020 7.5 13797 46 2 comedy TR
Shooter 2016 7.5 35547 41 3 war US
Reign 2013 7.5 47751 42 4 drama US
Blue Exorcist 2011 7.5 12741 24 2 scifi JP
Outer Banks 2020 7.5 43404 49 2 action US
Arrow 2012 7.5 425716 42 8 action US
Halston 2021 7.5 14040 47 1 drama US
The Night Shift 2014 7.5 12069 42 4 drama US
The Heirs 2013 7.5 10329 59 1 drama KR
The Politician 2019 7.5 21887 43 2 comedy US
Ragnarok 2020 7.5 36185 47 2 action NO
Everything Sucks! 2018 7.5 18023 24 1 drama US
Frequency 2016 7.5 13625 18 3 scifi US
Dark Matter 2015 7.5 41867 43 3 scifi CA
Night Stalker: The Hunt for a Serial Killer 2021 7.5 23939 47 1 crime US
Dash & Lily 2020 7.5 16978 25 1 comedy US
True Story 2021 7.5 16927 39 1 drama US
Hollywood 2020 7.5 35067 50 1 drama US
Designated Survivor 2016 7.5 88019 44 3 war US
Quicksand 2019 7.5 21077 46 1 drama SE
Feel Good 2020 7.5 10317 25 2 drama GB
Dogs of Berlin 2018 7.5 12453 60 1 drama DE
Evil Genius 2018 7.5 27516 48 1 crime US
13 Reasons Why 2017 7.5 282373 58 4 drama US
Lupin 2021 7.5 100575 46 3 crime FR
All of Us Are Dead 2022 7.5 41393 61 1 action KR
I Am Not Okay with This 2020 7.5 56459 21 1 comedy US
nrow(movies1); nrow(shows1)
## [1] 387
## [1] 246

Categories for Movies:

  • Title

  • Release Year

  • IMDb Score

  • Number of Votes

  • Duration (in minutes)

  • Main Genre

  • Main Production (Country Code)

Categories for Shows:

  • Title

  • Release Year

  • IMDb Score

  • Number of Votes

  • Duration (in minutes)

  • Number of Seasons

  • Main Genre

  • Main Production (Country Code)

The two datasets contain the same information with only the exception of shows having an additional variable, number of seasons. There are 387 movie observations and 246 shows observations.

Tidying the Data

We have a general idea of what we are working with. Next we want to look into each of the variables and check if any transformations need to be done, as well as if there is any missing data.

vis_miss(movies1); vis_miss(shows1)

From the missing values map, there are no missing data, so we will not have to remove any observations here.

m1 <- movies1 %>%
  group_by(MAIN_GENRE) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) + 
  geom_bar(stat = "identity", fill = "#ff9896") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Genre")
s1 <- shows1 %>%
  group_by(MAIN_GENRE) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) + 
  geom_bar(stat = "identity", fill = "#c49c94") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Genre")
m1 + s1

m2 <- movies1 %>%
  group_by(MAIN_PRODUCTION) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_PRODUCTION, count))) + 
  geom_bar(stat = "identity", fill = "#ff9896") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Production")
s2 <- shows1 %>%
  group_by(MAIN_PRODUCTION) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_PRODUCTION, count))) + 
  geom_bar(stat = "identity", fill = "#c49c94") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Production")
m2 + s2

As we can see, there are 35 different countries for movies and 19 different countries for shows. Since this is a categorical variable, this will create many dummy variables when we create our recipe. Plus, there are many countries with only one observation, so this will not make a good predictor variable for our models. We will group the main production company into regions (Asia/Oceania, North/South America, Europe, and Africa/Middle East) to make the data easier to work with. Similarly, there are a lot of categories for main genre, with 15 for movies and 12 for shows. For this variable, it isn’t as straightfoward grouping genres into categories, as they are already split by different genres. Therefore we will drop any levels that contain less than five observations.

m3 <- movies1 %>%
  ggplot(aes(x=RELEASE_YEAR)) + 
  geom_histogram(aes(y=..density..), fill = "black") + 
  geom_density(alpha=0.7, fill="#ff9896") + 
  theme_hc() + 
  labs(x = "Release Year", y = "Density")
s3 <- shows1 %>%
  ggplot(aes(x=RELEASE_YEAR)) + 
  geom_histogram(aes(y=..density..), fill = "black") + 
  geom_density(alpha=0.7, fill="#c49c94") + 
  theme_hc() + 
  labs(x = "Release Year", y = "Density")
m3 + s3

Looking at the bar graphs for release year, we see that the data is heavily skewed left. There are a few observations with release years in the 1900’s, but because there are so few we will only look at movies and shows released in or after the year 2000. We will also change release year from a numeric variable to a categorical variable.

movies2 <- subset(movies1, MAIN_PRODUCTION!="XX" & RELEASE_YEAR >= 2000) # XX is not a country
movies_genre_counts <- table(movies2$MAIN_GENRE)
movies_selected_genres <- movies_genre_counts[movies_genre_counts >= 5]
movies2 <- subset(movies2, MAIN_GENRE %in% names(movies_selected_genres)) # only keeping main genre levels with more than 5 obs
movies <- movies2 %>%
  mutate(REGION = forcats::fct_collapse(MAIN_PRODUCTION,
                                        AsiaOceania = c("CN", "HK", "ID", "IN", "JP", 
                                                 "KH", "KR", "TH", "AU", "NZ"),
                                        AfricaME = c("CD", "MW", "ZA", "PS", "TR"),
                                        NSAmerica = c("CA", "US", "AR", "BR", "MX"),
                                        Europe = c("BE", "DE", "DK", "ES", "FR", 
                                                   "GB", "HU", "IE", "IT", "LT",
                                                   "NL", "NO", "PL", "UA"))) %>%
  select(-MAIN_PRODUCTION)
movies$RELEASE_YEAR <- factor(movies$RELEASE_YEAR, ordered=TRUE)

shows2 <- subset(shows1, RELEASE_YEAR >= 2000)
shows_genre_counts <- table(shows2$MAIN_GENRE)
shows_selected_genres <- shows_genre_counts[shows_genre_counts >= 5]
shows2 <- subset(shows2, MAIN_GENRE %in% names(shows_selected_genres))
shows <- shows2 %>%
  mutate(REGION = forcats::fct_collapse(MAIN_PRODUCTION,
                                        AsiaOceania = c("IN", "JP", "KR", "AU"),
                                        AfricaME = c("TR", "IL"),
                                        NSAmerica = c("CA", "US", "BR"),
                                        Europe = c("BE", "DE", "DK", "ES", "FI",
                                                   "FR", "GB", "IT", "NO", "SE"))) %>%
  select(-MAIN_PRODUCTION)
shows$RELEASE_YEAR <- factor(shows$RELEASE_YEAR, ordered=TRUE)

Let’s take a quick look at our new data:

m4 <- movies %>%
  group_by(MAIN_GENRE) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) + 
  geom_bar(stat = "identity", fill = "#ff9896") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Genre")
s4 <- shows %>%
  group_by(MAIN_GENRE) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) + 
  geom_bar(stat = "identity", fill = "#c49c94") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Main Genre")
m4 + s4

m5 <- movies %>%
  group_by(REGION) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(REGION, count))) + 
  geom_bar(stat = "identity", fill = "#ff9896") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Region")
s5 <- shows %>%
  group_by(REGION) %>%
  summarize(count = n()) %>%
  ggplot(aes(x=count, y=reorder(REGION, count))) + 
  geom_bar(stat = "identity", fill = "#c49c94") + 
  geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) + 
  theme_hc() + 
  labs(x = "Count", y = "Region")
m5 + s5

m6 <- movies %>%
  ggplot(aes(x=RELEASE_YEAR)) + 
  geom_bar(fill = "#ff9896") + 
  theme_hc() + 
  labs(x = "Release Year", y = "Count") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
s6 <- shows %>%
  ggplot(aes(x=RELEASE_YEAR)) + 
  geom_bar(fill = "#c49c94") + 
  theme_hc() + 
  labs(x = "Release Year", y = "Count") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
m6 + s6

Visual EDA

Variable Distributions

Before getting into our model building, we want to look at the distribution of movies and shows. Let’s look at the distribution of scores, main genres, and regions.

movies$TYPE <- rep("movie", nrow(movies))
shows$TYPE <- rep("show", nrow(shows))
netflix_combined <- dplyr::bind_rows(movies[c(3:9)], shows[c(3:6,8:10)])

netflix_combined %>%
  ggplot(aes(x=TYPE, y=SCORE, fill=TYPE)) + 
  geom_boxplot() +
  theme_hc() + 
  scale_fill_manual(values = c("#ff9896", "#c49c94")) + 
  labs(x = "Type", y = "Score", title = "Box Plot of Score Distribution", fill = "Type")

netflix_combined %>%
  ggplot(aes(x=TYPE, fill=MAIN_GENRE)) +
  geom_bar() + 
  theme_hc() + 
  labs(x = "Type", y = "Count", title = "Stacked Bar Chart of Main Genres", fill = "Main Genre")

netflix_combined %>%
  ggplot(aes(x=TYPE, fill=REGION)) +
  geom_bar() + 
  theme_hc() + 
  labs(x = "Type", y = "Count", title = "Stacked Bar Chart of Regions", fill = "Region")

netflix_combined %>%
  ggplot(aes(x=NUMBER_OF_VOTES, fill=TYPE)) + 
  geom_density(alpha = 0.7) + 
  scale_fill_manual(values = c("#ff9896", "#c49c94")) + 
  theme_hc() + 
  labs(x = "Number of Votes", y = "Density", title = "Density Plot of Number of Votes", fill = "Type")

Here are some things we observe:

  • SCORE: Movie scores range from 6.9 to 9.0 and show scores range from 7.5 to 9.5. Show scores have a higher median than movie scores, but their ranges are about the same.

  • MAIN GENRE: For both movies and shows, drama is the genre with the most observations, making up about 1/3 of each dataset. This is followed by thriller then comedy for movies, and comedy and scifi for shows. This distribution is very obviously heavily uneven, which is something we’ll have to keep in mind when we’re building our models.

  • REGION: For movies, there is a decent proportion of observations rom North/South America, Asia/Oceania, and Europe/Middle East, with only two observations in Africa. For shows, over half of the observations are from North/South America.

  • NUMBER OF VOTES: The distribution of the number of votes is heavily skewed right. This aligns with our understanding that most movies and shows will have less votes, and only a few really popular ones have a higher nunber of votes.

Now that we have explored individual variables, we want to look at if there is any relationship between variables.

Variable Correlation Plot

movies %>%
  dplyr::select(SCORE, NUMBER_OF_VOTES, DURATION) %>%
  cor() %>%
  corrplot(type="lower", method="color", diag=FALSE, addCoef.col = "black", number.cex = 1)

shows %>%
  dplyr::select(SCORE, NUMBER_OF_VOTES, DURATION, NUMBER_OF_SEASONS) %>%
  cor() %>%
  corrplot(type="lower", method="color", diag=FALSE, addCoef.col = "black", number.cex = 1)

While our dataset does not contain many numeric variables, it is still interesting to look at the correlation plot of what we have. The strongest correlation is the number of votes and score. This may be explained by the fact that the more popular a movie or show is, the more high scores it receives. The number of seasons of a show and the number of votes also has a moderately strong correlation. This also isn’t surprising, as a show having more seasons often means it is popular and long-running and will accumulate more votes.

Genre and Score

Since main genre is our variable of interest, we want to see how each genre is correlated with our predictor variables. In this plot we see the distribution of scores for each genre of movies and shows. For movies, there is a wide range of score values for each genre. Scifi, documentary, and comedy have the highest median as well as range, as horror has the lowest. For shows, there is not much of a drastic difference in score ranges compared to the movies plot. Drama, documentary, scifi, and action have the largest range while war has the smallest.

Score and Number of Votes

m8 <- movies %>%
  ggplot(aes(x=SCORE, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) + 
  geom_point() + 
  theme_hc() + 
  labs(x = "Score", y = "Number of Votes", title="Movies", color = "Main Genre") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.key.size = unit(1, 'cm'),
        legend.key.height = unit(0.5, 'cm'),
        legend.key.width = unit(0.5, 'cm'),
        legend.title = element_text(size=6),
        legend.text = element_text(size=4))
s8 <- shows %>%
  ggplot(aes(x=SCORE, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) + 
  geom_point() + 
  theme_hc() + 
  labs(x = "Score", y = "Number of Votes", title="Shows", color = "Main Genre") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.key.size = unit(0.5, 'cm'),
        legend.key.height = unit(0.5, 'cm'),
        legend.key.width = unit(0.5, 'cm'),
        legend.title = element_text(size=6),
        legend.text = element_text(size=4))
m8 + s8

From the plots we see most movies have below 800,000 votes and most shows have below 500,000 with a few outliers. The observations with a high number of votes all have a high score as well. Main genre appears to be evenly scattered in the plot, so there might not be a high correlation between either variable and genre.

Genre and Duration

m9 <- movies %>%
  ggplot(aes(y=MAIN_GENRE, x=DURATION)) + 
  geom_boxplot(fill="#ff9896") + 
  theme_hc() + 
  labs(x = "Duration", y = "Main Genre", title = "Movies")
s9 <- shows %>%
  ggplot(aes(y=MAIN_GENRE, x=DURATION)) + 
  geom_boxplot(fill="#c49c94") + 
  theme_hc() + 
  labs(x = "Duration", y = "Main Genre", title = "Shows")
m9 + s9

There is an abundant amount of variation in duration between different genres for both movies and shows. For movies, scifi has the highest median duration with western, romance, drama, and crime closely behind. Drama contains many outliers of longer durations. For shows, crime has the highest median, and comedy has the lowest. Because of these clear distinctions, duration might be a good predictor for main genre.

Number of Seasons

The number of seasons is only present in our shows dataset. Let’s look at how it relates to the other variables.

shows %>%
  ggplot(aes(x=NUMBER_OF_SEASONS, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) + 
  geom_point() + 
  theme_hc() + 
  labs(x = "Number of Seasons", y = "Number of Votes", title = "Scatterplot of Number of Votes against Number of Seasons", color = "Main Genre")

shows %>%
  ggplot(aes(x=MAIN_GENRE, y=NUMBER_OF_SEASONS)) + 
  geom_boxplot() + 
  theme_hc() +
  labs(x = "Main Genre", y = "Number of Seasons", title = "Box Plot of Number of Seasons per Genre")

Here we are looking at relationships with the number of seasons for shows only. There does not appear to be an obvious relationship between the number of seasons and the number of votes, but upon a closer glance we see that all the shows with a high number of votes have at least 5 seasons. In the boxplot distribution of the number of seasons for each genre, we see that comedy has the highest median, but action and drama have multiple outliers with the highest number of seasons. Documentary is the genre with the least number of seasons, which makes sense.

Setting up Models

Train/Test Split

Before building our models, we need to split the data into training and testing data sets. I will be using a 80/20 split and stratifying on the outcome variable, score, for both datasets. We will be building our models on our training set. Furthermore, we will be using the k-fold cross-validation method with five folds to evaluate the model’s test error rate on new data.

set.seed(131) # setting a seed to replicate results
movies_split <- initial_split(movies, prop=0.8, strata=MAIN_GENRE)
movies_train <- training(movies_split)
movies_test <- testing(movies_split)

shows_split <- initial_split(shows, prop=0.8, strata=MAIN_GENRE)
shows_train <- training(shows_split)
shows_test <- testing(shows_split)

nrow(movies_train); nrow(movies_test)
## [1] 261
## [1] 66
nrow(shows_train); nrow(shows_test)
## [1] 180
## [1] 46

For movies, there are 260 observations in the training dataset and 67 in the testing dataset. For shows, there are 180 in training and 46 in testing.

Building Our Recipe

Next, we are going to build a recipe for all our models. This recipe is like a general guide of which predictors to use, how to use them, and what to do with them. Each model that we build will be using the same recipe, but will work with it in their own way unique to that specific model. The variables we are using to predict the main genre are release year, score, number of votes, duration, and region for movies, and the same plus number of seasons for shows.

movies_recipe <- movies_train %>%
  recipe(MAIN_GENRE ~ RELEASE_YEAR + NUMBER_OF_VOTES + DURATION + SCORE + REGION) %>%
  step_naomit() %>%
  step_dummy(all_nominal_predictors()) %>%
  step_interact(terms = ~ NUMBER_OF_VOTES:SCORE) %>%
  step_normalize(NUMBER_OF_VOTES, DURATION, SCORE)

shows_recipe <- shows_train %>%
  recipe(MAIN_GENRE ~ RELEASE_YEAR + NUMBER_OF_VOTES + DURATION + NUMBER_OF_SEASONS + SCORE + REGION) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_interact(terms = ~ NUMBER_OF_SEASONS:NUMBER_OF_VOTES + NUMBER_OF_VOTES:SCORE) %>%
  step_normalize(NUMBER_OF_VOTES, NUMBER_OF_SEASONS, DURATION, SCORE)
movies_recipe %>%
  prep() %>%
  bake(new_data = movies_train) %>%
  head() %>%
  kable() %>%
  kable_styling("striped", full_width = TRUE) %>%
  scroll_box(width = "1000px", height = "250px")
NUMBER_OF_VOTES DURATION SCORE MAIN_GENRE RELEASE_YEAR_01 RELEASE_YEAR_02 RELEASE_YEAR_03 RELEASE_YEAR_04 RELEASE_YEAR_05 RELEASE_YEAR_06 RELEASE_YEAR_07 RELEASE_YEAR_08 RELEASE_YEAR_09 RELEASE_YEAR_10 RELEASE_YEAR_11 RELEASE_YEAR_12 RELEASE_YEAR_13 RELEASE_YEAR_14 RELEASE_YEAR_15 RELEASE_YEAR_16 RELEASE_YEAR_17 RELEASE_YEAR_18 RELEASE_YEAR_19 RELEASE_YEAR_20 RELEASE_YEAR_21 RELEASE_YEAR_22 REGION_AsiaOceania REGION_Europe REGION_AfricaME NUMBER_OF_VOTES_x_SCORE
9.9347858 0.8562228 2.972114 scifi -0.0314347 -0.2284779 0.0716822 0.2189046 -0.1113334 -0.2016875 0.1496644 0.1773300 -0.1860008 -0.1453917 0.2195311 0.1049819 -0.2491707 -0.0545830 0.2733215 -0.0085337 -0.2893212 0.0900310 0.2918350 -0.2054929 0.3693311 0.3646229 0 1 0 19960934.4
-0.4651770 1.2868088 2.744095 comedy -0.2514778 0.1062688 0.1102803 -0.2622438 0.2603488 -0.1083494 -0.1062453 0.2675052 -0.2955184 0.1825521 0.0141874 -0.2110746 0.3410505 -0.3781936 0.3367461 -0.2527514 0.1626832 -0.0899621 0.0423123 -0.0165073 -0.0051674 0.0002459 1 0 0 179176.5
-0.3565408 -1.3325895 2.744095 comedy 0.3143473 0.2975526 0.1929906 0.0367141 -0.1301744 -0.2708734 -0.3612340 -0.3933900 -0.3741137 -0.3194662 -0.2482792 -0.1767750 -0.1156103 -0.0694340 -0.0381949 -0.0191462 -0.0086753 -0.0035097 -0.0012441 -0.0003746 0.0000837 -0.0000374 0 0 0 383443.8
-0.4940445 -2.3014081 2.060037 comedy 0.1571737 -0.1009553 -0.2481307 -0.1151112 0.1490154 0.2503273 0.0767729 -0.1834196 -0.2512957 -0.0548575 0.2034769 0.2637044 0.0673126 -0.2004877 -0.3003033 -0.1517471 0.1244823 0.3380159 0.3838752 0.2900003 -0.1354521 0.0883456 0 0 0 120590.4
1.2245250 1.6456305 2.060037 comedy -0.0628695 -0.2125376 0.1378504 0.1670079 -0.1986872 -0.0977828 0.2371605 0.0115947 -0.2460934 0.0830810 0.2195311 -0.1749698 -0.1533358 0.2491226 0.0462544 -0.2852890 0.0979241 0.2540047 -0.2648964 -0.1033544 -0.4819533 -0.2290638 1 0 0 3240568.8
-0.4785257 -0.9378857 1.832018 documentary 0.1257389 -0.1487763 -0.2315887 -0.0115939 0.2260924 0.1740970 -0.1025284 -0.2529145 -0.0863916 0.1918775 0.2404388 -0.0016664 -0.2500834 -0.2173991 0.0623734 0.2897966 0.2326439 -0.0536990 -0.3232634 -0.3939626 0.2417697 -0.1843048 0 1 0 146993.0
shows_recipe %>%
  prep() %>%
  bake(new_data = shows_train) %>%
  head() %>%
  kable() %>%
  kable_styling("striped", full_width = TRUE) %>%
  scroll_box(width = "1000px", height = "250px")
NUMBER_OF_VOTES DURATION NUMBER_OF_SEASONS SCORE MAIN_GENRE RELEASE_YEAR_01 RELEASE_YEAR_02 RELEASE_YEAR_03 RELEASE_YEAR_04 RELEASE_YEAR_05 RELEASE_YEAR_06 RELEASE_YEAR_07 RELEASE_YEAR_08 RELEASE_YEAR_09 RELEASE_YEAR_10 RELEASE_YEAR_11 RELEASE_YEAR_12 RELEASE_YEAR_13 RELEASE_YEAR_14 RELEASE_YEAR_15 RELEASE_YEAR_16 RELEASE_YEAR_17 RELEASE_YEAR_18 RELEASE_YEAR_19 RELEASE_YEAR_20 RELEASE_YEAR_21 RELEASE_YEAR_22 REGION_Europe REGION_NSAmerica REGION_AfricaME NUMBER_OF_SEASONS_x_NUMBER_OF_VOTES NUMBER_OF_VOTES_x_SCORE
-0.3293163 0.4752719 -0.8106196 2.6532162 documentary 0.2514778 0.1062688 -0.1102803 -0.2622438 -0.2603488 -0.1083494 0.1062453 0.2675052 0.2955184 0.1825521 -0.0141874 -0.2110746 -0.3410505 -0.3781936 -0.3367461 -0.2527514 -0.1626832 -0.0899621 -0.0423123 -0.0165073 0.0046333 -0.0023010 1 0 0 41386 384889.8
0.4372215 -0.0950544 -0.8106196 2.2200381 action 0.3143473 0.2975526 0.1929906 0.0367141 -0.1301744 -0.2708734 -0.3612340 -0.3933900 -0.3741137 -0.3194662 -0.2482792 -0.1767750 -0.1156103 -0.0694340 -0.0381949 -0.0191462 -0.0086753 -0.0035097 -0.0012441 -0.0003746 0.0000837 -0.0000374 0 1 0 175412 1596249.2
0.9291804 0.6653806 -0.4782471 1.1370927 crime 0.1886084 -0.0425075 -0.2371027 -0.2062064 0.0205539 0.2309552 0.2315029 0.0223611 -0.2106354 -0.2666222 -0.1004317 0.1517793 0.2957191 0.2308505 0.0080595 -0.2319204 -0.3677303 -0.3652133 -0.2697823 -0.1531004 0.0582812 -0.0339028 0 1 0 522858 2248289.4
0.3711405 -0.7287502 0.5188704 1.1370927 action 0.2200431 0.0265672 -0.1929906 -0.2636240 -0.1318872 0.1012211 0.2628173 0.2376904 0.0460817 -0.1807506 -0.2975617 -0.2381533 -0.0434299 0.1810493 0.3360453 0.3781316 0.3247118 0.2256041 0.1284866 0.0591874 -0.0190988 0.0101736 0 1 0 819290 1409178.8
-0.0793705 0.4752719 -0.4782471 1.1370927 action 0.2200431 0.0265672 -0.1929906 -0.2636240 -0.1318872 0.1012211 0.2628173 0.2376904 0.0460817 -0.1807506 -0.2975617 -0.2381533 -0.0434299 0.1810493 0.3360453 0.3781316 0.3247118 0.2256041 0.1284866 0.0591874 -0.0190988 0.0101736 0 0 0 170176 731756.8
0.1573236 0.7921198 0.5188704 0.9205036 action 0.1257389 -0.1487763 -0.2315887 -0.0115939 0.2260924 0.1740970 -0.1025284 -0.2529145 -0.0863916 0.1918775 0.2404388 -0.0016664 -0.2500834 -0.2173991 0.0623734 0.2897966 0.2326439 -0.0536990 -0.3232634 -0.3939626 0.2417697 -0.1843048 1 0 0 632365 1075020.5

K-Fold Cross Validation

We will split our data into five different folds using k-fold cross-validation. This cross validation technique is used to estimate the test error rate using available training data by dividing the set of observations into k roughly equal size groups, or folds, then treating each fold as the validation set in turns and fitting the method on the other folds until each fold has been treated as the validation set. Using k-fold cross-validation rather than simply comparing our model results on the entire training set will help with avoiding overfitting to the training data, and will reduce the variance of the performance estimate.

movies_folds <- vfold_cv(movies_train, v=5)
shows_folds <- vfold_cv(shows_train, v=5)

movies_folds; shows_folds
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits           id   
##   <list>           <chr>
## 1 <split [208/53]> Fold1
## 2 <split [209/52]> Fold2
## 3 <split [209/52]> Fold3
## 4 <split [209/52]> Fold4
## 5 <split [209/52]> Fold5
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits           id   
##   <list>           <chr>
## 1 <split [144/36]> Fold1
## 2 <split [144/36]> Fold2
## 3 <split [144/36]> Fold3
## 4 <split [144/36]> Fold4
## 5 <split [144/36]> Fold5

Building Models

Model Building Process

1. Set up the model by specifying the type of model and its parameters, and setting up the engine and mode.

We will be building a total of five models for both movies and shows:

  • K-nearest neighbors (tuning number of neighbors)

  • Elastic net regression (tuning mixture and penalty)

  • Pruned decision trees (tuning cost complexity)

  • Random forest (tuning the number of predictors, number of trees, and minimum number of observations in a node)

  • Gradient-boosted trees (tuning the number of predictors, number of trees, and the learning rate)

For each of the models, the mode is set as classification, as that is our goal.

# knn
movies_knn <- nearest_neighbor(neighbors=tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")
shows_knn <- nearest_neighbor(neighbors=tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# elastic net multinomial regression
movies_en <- multinom_reg(mixture = tune(), penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")
shows_en <- multinom_reg(mixture = tune(), penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

# pruned decision trees
movies_tree <- decision_tree(cost_complexity = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")
shows_tree <- decision_tree(cost_complexity = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# random forest
movies_forest <- rand_forest(mtry = tune(),
                             trees = tune(),
                             min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")
shows_forest <- rand_forest(mtry = tune(),
                            trees = tune(),
                            min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# gradient-boosted trees
movies_bt <- boost_tree(mtry = tune(),
                        trees = tune(),
                        learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
shows_bt <- boost_tree(mtry = tune(),
                        trees = tune(),
                        learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

2. Set up the workflow using the workflow() function and add the established model and recipe.

# knn
movies_knn_workflow <- workflow() %>%
  add_model(movies_knn) %>%
  add_recipe(movies_recipe)
shows_knn_workflow <- workflow() %>%
  add_model(shows_knn) %>%
  add_recipe(shows_recipe)

# elastic net multinomial regression
movies_en_workflow <- workflow() %>%
  add_model(movies_en) %>%
  add_recipe(movies_recipe)
shows_en_workflow <- workflow() %>%
  add_model(shows_en) %>%
  add_recipe(shows_recipe)

# pruned decision trees
movies_tree_workflow <- workflow() %>%
  add_model(movies_tree) %>%
  add_recipe(movies_recipe)
shows_tree_workflow <- workflow() %>%
  add_model(shows_tree) %>%
  add_recipe(shows_recipe)

# random forest
movies_forest_workflow <- workflow() %>%
  add_model(movies_forest) %>%
  add_recipe(movies_recipe)
shows_forest_workflow <- workflow() %>%
  add_model(shows_forest) %>%
  add_recipe(shows_recipe)

# gradient-boosted trees
movies_bt_workflow <- workflow() %>%
  add_model(movies_bt) %>%
  add_recipe(movies_recipe)
shows_bt_workflow <- workflow() %>%
  add_model(shows_bt) %>%
  add_recipe(shows_recipe)

3. Set up tuning grids for the parameters we want tuned, and specify the ranges as well as the number of levels we want.

# knn
movies_knn_grid <- grid_regular(neighbors(range=c(1, 10)), levels=10)
shows_knn_grid <- movies_knn_grid

# en
movies_en_grid <- grid_regular(penalty(range=c(0,3), 
                                trans = identity_trans()), 
                                mixture(range=c(0, 1)), levels=10)
shows_en_grid <- movies_en_grid

# pruned decision trees
movies_tree_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
shows_tree_grid <- movies_tree_grid

# random forest
movies_forest_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)), 
                        levels = 5)
shows_forest_grid <- movies_forest_grid

# gradient-boosted trees
movies_bt_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)
shows_bt_grid <- movies_bt_grid

4. Tune each of the models using the workflow as the object, folds as the resamples, and created grids.

# knn
movies_knn_tune <- tune_grid(
  object = movies_knn_workflow,
  resamples = movies_folds,
  grid = movies_knn_grid,
)
shows_knn_tune <- tune_grid(
  object = shows_knn_workflow,
  resamples = shows_folds,
  grid = shows_knn_grid,
)

# elastic net multinomial regression
movies_en_tune <- tune_grid(
  object = movies_en_workflow,
  resamples = movies_folds,
  grid = movies_en_grid
)
shows_en_tune <- tune_grid(
  object = shows_en_workflow,
  resamples = shows_folds,
  grid = shows_en_grid
)

# pruned decision tree
movies_tree_tune <- tune_grid(
  object = movies_tree_workflow,
  resamples = movies_folds,
  grid = movies_tree_grid
)
shows_tree_tune <- tune_grid(
  object = shows_tree_workflow,
  resamples = shows_folds,
  grid = shows_tree_grid
)

# random forest
movies_forest_tune <- tune_grid(
  object = movies_forest_workflow,
  resamples = movies_folds,
  grid = movies_forest_grid
)
shows_forest_tune <- tune_grid(
  object = shows_forest_workflow,
  resamples = shows_folds,
  grid = shows_forest_grid
)

# gradient-boosted trees
movies_bt_tune <- tune_grid(
  object = movies_bt_workflow,
  resamples = movies_folds,
  grid = movies_bt_grid
)
shows_bt_tune <- tune_grid(
  object = shows_bt_workflow,
  resamples = shows_folds,
  grid = shows_bt_grid
)

Because tuning each of the models takes a long time, we will save the results after running them into RDA files so that we don’t have to rerun them every time.

save(movies_knn_tune, file="movies_knn_results.rda")
save(movies_en_tune, file="movies_en_results.rda")
save(movies_tree_tune, file="movies_tree_results.rda")
save(movies_forest_tune, file="movies_forest_results.rda")
save(movies_bt_tune, file="movies_bt_results.rda")

save(shows_knn_tune, file="shows_knn_results.rda")
save(shows_en_tune, file="shows_en_results.rda")
save(shows_tree_tune, file="shows_tree_results.rda")
save(shows_forest_tune, file="shows_forest_results.rda")
save(shows_bt_tune, file="shows_bt_results.rda")

5. Load the saved results back in to use for our analysis.

load(file="movies_knn_results.rda")
load(file="movies_en_results.rda")
load(file="movies_tree_results.rda")
load(file="movies_forest_results.rda")
load(file="movies_bt_results.rda")

load(file="shows_knn_results.rda")
load(file="shows_en_results.rda")
load(file="shows_tree_results.rda")
load(file="shows_forest_results.rda")
load(file="shows_bt_results.rda")

6. Collect metrics of the tuned models.

movies_knn_metrics <- collect_metrics(movies_knn_tune)
movies_en_metrics <- collect_metrics(movies_en_tune)
movies_tree_metrics <- collect_metrics(movies_tree_tune)
movies_forest_metrics <- collect_metrics(movies_forest_tune)
movies_bt_metrics <- collect_metrics(movies_bt_tune)

shows_knn_metrics <- collect_metrics(shows_knn_tune)
shows_en_metrics <- collect_metrics(shows_en_tune)
shows_tree_metrics <- collect_metrics(shows_tree_tune)
shows_forest_metrics <- collect_metrics(shows_forest_tune)
shows_bt_metrics <- collect_metrics(shows_bt_tune)

Model Results

We have collected metrics from our model results, so now it is finally time to compare them and see which model was the best fit for our dataset. The performance is measured by the area under the ROC curve (ROC AUC), which measures the overall performance of our classifiers. A higher AUC means a better performance.

Let’s look at the plotted results from our models. The autoplot function in R allows us to visualize the result of each tuned parameter in our models.

K-nearest neighbors

autoplot(movies_knn_tune)

autoplot(shows_knn_tune)

We tuned our k-nearest neighbors models at 10 levels from 1 to 10 neighbors. For movies, the highest ROC AUC was 0.538 with k = 3. For shows, the highest ROC AUC was 0.653 with k = 2.

Elastic Net

autoplot(movies_en_tune)

autoplot(shows_en_tune)

We tuned our elastic net models at 10 levels of penalty and mixture. For movies, our best ROC AUC was 0.681 with penalty = 0 and mixture = 0 . For shows, our best ROC AUC was 0.661 with penalty = 0 and mixture = 0. Both values are the highest ROC AUC value for both models.

Decision Tree

autoplot(movies_tree_tune)

autoplot(shows_tree_tune)

For our decision tree model, we tuned 10 levels of cost complexity from -3 to 1. Our best ROC AUC for movies was 0.6 with a cost complexity of 0.0359. This did better than our knn model, but not as well as the elastic net model. The highest ROC AUC for shows was 0.611 with 0.0599. This did not do as well as the other two models.

Random Forest

autoplot(movies_forest_tune)

autoplot(shows_forest_tune)

For our random forest model, we tuned 5 levels of the number of predictors from 1 to 6, the number of trees from 200 to 600, and the minimum number of data points per node from 10 to 20. The highest ROC AUC was 0.63 for movies with mtry = 6, trees = 400, and min_n = 17, and 0.637 for shows with mtry = 4, trees = 200, and min_n = 20. Both of these are very high, and may be worth looking into.

Boosted Trees

autoplot(movies_bt_tune)

autoplot(shows_bt_tune)

We tuned 5 levels of number of predictors from 1 to 6, number of trees from 200 to 600, and learning rate from -10 to -1 for our boosted trees model. The highest ROC AUC for movies was 0.625 with mtry = 6, trees = 300, and learn_rate = 0.1, just behind the random forest model. For shows it was 0.629 with mtry = 2, trees = 600, and learn_rate = 0.1.

Best Model

Here is a visualization of the highest ROC AUC produced by each of our models.

movies_knn_highest <- bind_cols(arrange(movies_knn_metrics[movies_knn_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "K-nearest neighbors")
movies_en_highest <- bind_cols(arrange(movies_en_metrics[movies_en_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Elastic Net")
movies_tree_highest <- bind_cols(arrange(movies_tree_metrics[movies_tree_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Pruned Decision Tree")
movies_forest_highest <- bind_cols(arrange(movies_forest_metrics[movies_forest_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Random Forest")
movies_bt_highest <- bind_cols(arrange(movies_bt_metrics[movies_bt_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Boosted Decision Tree")

shows_knn_highest <- bind_cols(arrange(shows_knn_metrics[shows_knn_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "K-nearest neighbors")
shows_en_highest <- bind_cols(arrange(shows_en_metrics[shows_en_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Elastic Net")
shows_tree_highest <- bind_cols(arrange(shows_tree_metrics[shows_tree_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Pruned Decision Tree")
shows_forest_highest <- bind_cols(arrange(shows_forest_metrics[shows_forest_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Random Forest")
shows_bt_highest <- bind_cols(arrange(shows_bt_metrics[shows_bt_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Boosted Decision Tree")

movies_results <- bind_rows(movies_knn_highest, movies_en_highest, movies_tree_highest, movies_forest_highest, movies_bt_highest)
colnames(movies_results) <- c("ROC_AUC", "Model")
shows_results <- bind_rows(shows_knn_highest, shows_en_highest, shows_tree_highest, shows_forest_highest, shows_bt_highest)
colnames(shows_results) <- c("ROC_AUC", "Model")

movies_results %>%
  ggplot(aes(x=Model, y=ROC_AUC)) + 
  geom_col(fill="#ff9896") + 
  geom_text(aes(label = round(ROC_AUC, 3)), vjust = -0.5) + 
  ylim(0, 1) + 
  theme_hc() + 
  labs(y = "ROC AUC", title = "Comparing ROC AUC for Movies")

shows_results %>%
  ggplot(aes(x=Model, y=ROC_AUC)) + 
  geom_col(fill="#c49c94") + 
  geom_text(aes(label = round(ROC_AUC, 3)), vjust = -0.5) + 
  ylim(0, 1) + 
  theme_hc() + 
  labs(y = "ROC AUC", title = "Comparing ROC AUC for Shows")

Once again, the elastic net model resulted in the highest ROC AUC value for both the movies and shows dataset. For both, the penalty and mixture happen to be 0. This is the model we will be using to fit to our testing dataset.

show_best(movies_en_tune, metric="roc_auc")[1,] %>% kable()
penalty mixture .metric .estimator mean n std_err .config
0 0 roc_auc hand_till 0.6805623 2 0.0095567 Preprocessor1_Model001
show_best(shows_en_tune, metric="roc_auc")[1,] %>% kable()
penalty mixture .metric .estimator mean n std_err .config
0 0 roc_auc hand_till 0.661077 5 0.0275367 Preprocessor1_Model001

Before fitting the model to our testing sets, we will finalize the elastic net workflow using our best model, then fit it to our entire training dataset.

movies_best <- select_best(movies_en_tune, metric="roc_auc")
movies_final_workflow <- finalize_workflow(movies_en_workflow, movies_best)
movies_final_fit <- fit(movies_final_workflow, movies_train)

shows_best <- select_best(shows_en_tune, metric="roc_auc")
shows_final_workflow <- finalize_workflow(shows_en_workflow, shows_best)
shows_final_fit <- fit(shows_final_workflow, shows_train)

And finally, we can fit it to our testing sets and look at how it performed with our new data.

Final Model Performance

movies_final_test <- augment(movies_final_fit, new_data=movies_test) %>%
  dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test$MAIN_GENRE <- factor(movies_final_test$MAIN_GENRE)
roc_auc(movies_final_test, truth=MAIN_GENRE,
        .pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
  kable()
.metric .estimator .estimate
roc_auc hand_till 0.515081
shows_final_test <- augment(shows_final_fit, new_data=shows_test) %>%
  select(MAIN_GENRE, starts_with(".pred"))
shows_final_test$MAIN_GENRE <- factor(shows_final_test$MAIN_GENRE)
roc_auc(shows_final_test, truth=MAIN_GENRE,
        .pred_action:.pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_scifi:.pred_war) %>%
  kable()
.metric .estimator .estimate
roc_auc hand_till 0.7508291

The ROC AUC value of our model for movies was 0.551, and the value for shows was 0.721. Evidently, our model did not do the best on our movies dataset. It might have overfitted to our training set, resulting in a lower ROC AUC for our testing set. On the other hand, our model did very well for shows. We can say that our model is able to predict shows better than it is able to predict movies.

Because the ROC AUC values were quite similar when we tested them on our training dataset, it might be worth exploring the results of some of the other models on our testing datasets for movies to see if they can produce a higher ROC AUC. In particular, we are interested in the random forest and boosted decision tree models. We will be using the same steps to fit our models to our testing data.

movies_best_forest <- select_best(movies_forest_tune, metric="roc_auc")
movies_final_workflow_forest <- finalize_workflow(movies_forest_workflow, movies_best_forest)
movies_final_fit_forest <- fit(movies_final_workflow_forest, movies_train)

movies_final_test_forest <- augment(movies_final_fit_forest, new_data=movies_test) %>%
  dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test_forest$MAIN_GENRE <- factor(movies_final_test_forest$MAIN_GENRE)
roc_auc(movies_final_test_forest, truth=MAIN_GENRE,
        .pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
  kable()
.metric .estimator .estimate
roc_auc hand_till 0.5654177
movies_best_bt <- select_best(movies_bt_tune, metric="roc_auc")
movies_final_workflow_bt <- finalize_workflow(movies_bt_workflow, movies_best_bt)
movies_final_fit_bt <- fit(movies_final_workflow_bt, movies_train)

movies_final_test_bt <- augment(movies_final_fit_bt, new_data=movies_test) %>%
  dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test_bt$MAIN_GENRE <- factor(movies_final_test_bt$MAIN_GENRE)
roc_auc(movies_final_test_bt, truth=MAIN_GENRE,
        .pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
  kable()
.metric .estimator .estimate
roc_auc hand_till 0.5011232

Interestingly, the boosted trees model fits the movies testing dataset much better with an ROC AUC of 0.603. This is still not as high as the ROC AUC for shows, but it is an improvement from the elastic net model.

Why is this happening?

The reason the elastic net model produced a lower ROC AUC when fitted to our testing dataset is likely because the model overfitted to the training data. It is also because we did not have that many predictors in our movies dataset, so that might have caused the elastic net model to not be the best at predicting.

Variable Importance

Let’s take a look at the variable importance graph using the vip function. This tells us which predictors were the most important in determining the genre of a movie or show. For movies, the duration, number of votes, score, and the interaction between the number of votes and score were the most important. For shows, the region and release year take up the top spots of the chart.

movies_final_fit_bt %>%
  extract_fit_parsnip() %>%
  vip()

shows_final_fit %>%
  extract_fit_parsnip() %>%
  vip()

Conclusion

Through this project, we learned that the best predictors for a movie’s genre are the duration, number of votes, and score of the movie on IMBd, and the best predictors for a show’s genre are the region and release year. Surprisingly, the predictors are very different. After fitting multiple models to both our datasets, we come to the conclusion that the best model for movies is the boosted trees model and the best for shows is the elastic net model. However, both models, especially in predicting movies, have much room for improvement.

One of our main issues was that we did not have that many predictor variables to start with. Given more, our models might have turned out a lot more accurate. It might also be worth looking into other models such as the Naive Bayes and Support Vector Machine models. Having a larger dataset with more observations would also help our model. We were trying to predict a factor with many levels, and some of these levels only contained a few observations. If there was more data for each main genre, our model would be better trained.

Sources

This dataset was taken from the Kaggle dataset “Netflix TV Shows and Movies”.